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Q Abstract 



We address the problem of answering Web ontology queries efficiently. An ontology is 
formalized as a Deductive Ontology Base (DOB), a deductive database that comprises 
the ontology's inference axioms and facts. A cost-based query optimization technique for 
DOB is presented. A hybrid cost model is proposed to estimate the cost and cardinality 
of basic and inferred facts. Cardinality and cost of inferred facts are estimated using an 
adaptive sampling technique, while techniques of traditional relational cost models are 
used for estimating the cost of basic facts and conjunctive ontology queries. Finally, we 

Z^ implement a dynamic-programming optimization algorithm to identify query evaluation 

• plans that minimize the number of intermediate inferred facts. We modeled a subset of 

" the Web ontology language OWL Lite as a DOB, and performed an experimental study to 

analyze the predictive capacity of our cost model and the benefits of the query optimization 
technique. Our study has been conducted over synthetic and real-world OWL ontologies, 
and shows that the techniques are accurate and improve query performance. To appear in 
Theory and Practice of Logic Programming (TPLP) 



Cu 1 Introduction 

Ontology systems usually provide reasoning and retrieval services that identify the 
basic facts that satisfy a requirement, and derive implicit knowledge using the 
ontology's inference axioms. In the context of the Semantic Web, the number of 
inferred facts can be extremely large. On one hand, the amount of basic ontology 
facts (domain concepts and Web source annotations) can be considerable, and on 
the other hand. Open World reasoning in Web ontologies may yield a large space 
of choices. Therefore, efficient evaluation strategies are needed in Web ontology's 
inference engines. 

In our approach, ontologies are formalized as a deductive database called a De- 
ductive Ontology Base (DOB). The extensional database comprises all the ontology 
language's statements that represent the explicit ontology knowledge. The inten- 
sional database corresponds to the set of deductive rules which define the semantics 
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of the ontology language. We provide a cost-based optimization technique for Web 
ontologies represented as a DOB. 

Traditional query optimization techniques for deductive databases systems in- 
clude join-ordering strategies, and techniques that combine a bottom-up evaluation 
with top-down propagation of query variable bindings in the spirit of the Magic- 



Sets algorithm ( Ramakrishnan and Ullman 1993). Join-ordering strategies may be 



heuristic-based or cost-based; some cost-based approaches depend on the estimation 



of the join selectivity; others rely on the fan-out of a literal (Staudt et al. 19991 



Cost-based query optimization has been successfully used by relational database 
management systems; however, these optimizers are not able to estimate the cost 
or cardinality of data that do not exist a priori, which is the case of intensional 
predicates in a DOB. 

We propose a hybrid cost model that combines two techniques for cardinality 



and cost estimation: (1) the sampling technique proposed in (Lipton and Naughton 



1990: Lipton et al. 19901 is applied for the estimation of the evaluation cost and 



cardinality of intensional predicates, and (2) a cost model a la System R cost model 
is used for the estimation of the cost and cardinality of extensional predicates and 
the cost of conjunctive queries. 

Three evaluation strategies are considered for "joining" predicates in conjunctive 
queries. They are based on the Nested-Loop, Block Nested-Loop, and Hash Join 



operators of relational databases ( Ramakrishnan and Gehrke 2003 1 . To identify a 



good evaluation plan, we provide a dynamic-programming optimization algorithm 
that orders subgoals in a query, considering estimates of the subgoal's evaluation 
cost. 



We modeled a subset of the Web ontology language OWL Lite (McGuinness 



and Harmelen 20041 as a DOB, and performed experiments to study the predic- 



tive capacity of the cost model and the benefits of the ontology query optimization 
techniques. The study has been conducted over synthetic and real- world OWL on- 
tologies. Preliminary results show that the cost-model estimates are pretty accurate 
and that optimized queries are significantly less expensive than non-optimized ones. 
Our current formalism does not represent the OWL built-in constructor Comple- 
mentOf. We stress that in practice this is not a severe limitation. For example, this 
operator is not used in any of the three real-world ontologies that we have studied 



in our experiments; and in the survey reported in (Wang 20061, only 21 ontologies 
out of 688 contain this constructor. 

Our work differs from other systems in the Semantic Web that combine a Descrip- 
tion Logics (DL) reasoner with a relational DBMS in order to solve the scalability 



problems for reasoning with individuals ( Calvanese et al. 2005 Haarslev and MoUer 



2004 Horrocks and Turi 2005 Pan and Hefflin 20031. Clearly, all of these systems 



use the query optimization component embedded in the relational DBMS; how- 
ever, they do not develop cost-based optimization for the implicit knowledge, that 
is, there is no estimation of the cost of data not known a priori. 

Other systems use Logic Programming (LP) to reason on large-scale ontologies. 



This is the case of the projects described in ( Grosof et al. 2003 Hustadt and Motik 



2005 Motik et al. 2003 1 . In Description Logic Programs (DLP) ( Grosof et al. 2003 1 
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the expressive intersection between DL and LP without function symbols is defined. 
DL queries are reduced to LP queries and efficient LP algorithms are explored. The 



project described in (Hustadt and Motik 2005 Motik et al. 20031 reduces a STCIQ 



knowledge base to a Disjunctive Datalog program. Both projects apply Magic-Sets 
rewriting techniques but to the best of our knowledge, no cost-based optimization 
techniques have been developed. The OWL Lite" species of the OWL language 



proposed in (Bruijn et al. 2004) is based in the DLP project; it corresponds to 
the portion of the OWL Lite language that can be translated to Datalog. All of 
these systems develop LP reasoning with individuals, whereas in the DOB model 
we develop Datalog reasoning with both, domain concepts and individuals. 



In ( Eiter et al. 2006 1 , an efficient bottom-up evaluation strategy for HEX-programs 
based on the theory of splitting sets is described. In the context of the Semantic 
Web, these non-monotonic logic programs contain higher-order atoms and exter- 
nal atoms that may represent RDF and OWL knowledge. However, their approach 
does not include determining the best evaluation strategy according to a certain 
cost metric. 

In the next section we describe our DOB formalism. Following this, we describe 
the DOB-S System architecture, Then, we model a subset of OWL Lite as a DOB 
and present a motivating example. Next, we develop our hybrid cost model and 
query optimization algorithm. We describe our experimental study and, finally, we 
point out our conclusions and future work. 



2 The Deductive Ontology Base (DOB) 

In general, an ontology knowledge base can be defined as: 

Definition 1 {Ontology Knowledge Base) 

An ontology knowledge base O is a pair O = {!F, T) , where JF is a set of ontology 
facts that represent the explicit ontology structure (domain) and source annotations 
(individuals), and X is a set of axioms that allow the inference of new ontology facts 
regarding both domain and individuals. 

We will model O as a deductive database which we call a Deductive Ontology 
Base (DOB). A DOB is composed of an Extensional Ontology Base (EOB) and an 
Intensional Ontology Base (lOB). Formally, a DOB is defined as: 

Definition 2 (DOB) 

Given an ontology knowledge base O — {!F,T), a DOB is a deductive database 
composed of a set of built-in EOB ground predicates representing J^ and a set of 
lOB built-in predicates representing T, i.e. that define the semantics of the EOB 
built-in predicates. 

The lOB predicate and DOB query definitions follow the Datalog language for- 



malism (Abiteboul et al. 19951. Next, we provide the definitions related to query- 



answering for DOBs. 
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Definition 3 ( Valid Instantiation) 

Given a Deductive Ontology Base O, a set of constants C in O, a set of variables 
V, a rule i?, and an interpretation I of O that corresponds to its Minimal Perfect 
Model ( Abiteboul et al. 1995 1, a valuation F] 7 is a valid instantiation of R if and 



only if, 7(i?) evaluates to true in I. 

Definition 4 {Intermediate Inferred Facts) 

Given a Deductive Ontology Base O, and a query q : Q{X) ^- 3YB{X,Y)). A 

proof tree for q wrt O is defined as follows: 

• Each node in the tree is labeled by a predicate in O. 

• Each leaf in the tree is labeled by a predicate in O's EOB. 

• The root of the tree is labeled by Q 

• For each internal node N including the root , if A^ is labeled by a predicate A 
defined by the rule R, A(X) ^ 3YC(X,Y)), where C(X,Y)) is the conjunc- 
tion of the predicates Ci, ..., C„, then, for each valid instantiation of R, 7, the 
node A^ has a sub-tree whose root is 'j{A{X)) and its children are respectively 
labeled jiCl),..., 7(C„). 

The valuations needed to define all the valid instantiations in the proof tree corres- 
pond to the Intermediate Inferred Facts of q. 

The number of intermediate inferred facts measures the evaluation cost of the 
query Q. Additionally, since the valid instantiations of Q in the proof tree corres- 
pond to the answers of the query, the cardinality of Q corresponds to the number 
of such instantiations. 

Note that the sets of EOB and lOB built-in predicates of a DOB define an 
ontology framework, so our model is not tied to any particular ontology language. 
To illustrate the use of our approach we focus on OWL Lite ontologies. 

3 The DOB-S System's Architecture 

DOB-S is a system that allows an agent to pose efficient conjunctive queries against 
a set of ontologies. The system's architecture can be seen in Figure [T] 

A subset of a given OWL ontology is translated into a DOB using an OWL Lite 
to DOB translator. EOB and lOB predicates are stored as a deductive database. 
Next, an analyzer generates the ontology's statistics: for each EOB predicate, 
the analyzer computes the number of facts or valid instantiations in the DOB 
(cardinality), and the number of different values for each of its arguments (nKeys); 



for each lOB predicate, an adaptive sampling algorithm (Lipton and Naughton 



1990 1 is applied to compute cardinality and cost estimates. 

When an agent formulates a conjunctive query, the DOB-S system's optimizer 
generates an efficient query evaluation plan. A dynamic-programming optimizer is 
based in a hybrid cost model: it uses the ontology's EOB and lOB statistics, 

^ Given a set of variables V and a set of constants C, a mapping or valuation 7 is a function 
7 : V ^ C. 
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Fig. 1. DOB-S System Architecture 

and estimates the cost of a query according the different evaluation strategies im- 
plemented. Finally, an execution engine evaluates the query plan and produces 
a query answer. 

4 OWL Lite DOB 

An OWL Lite ontology contains: (1) a set of axioms that provides information 
about classes and properties, and (2) a set of facts that represents individuals in 
the ontology, the classes they belong to, and the properties they participate in. 

Restrictions allow the construction of class definitions by restricting the values of 
their properties and their cardinality. Classes may also be defined through the inter- 
section of other classes. Object properties represent binary relationships between 
individuals; datatype properties correspond to relationships between individuals 
and data values belonging to primitive datatypes. 

The subset of OWL Lite represented as a DOB does not include domain and 
range class intersection. Also, primitive datatypes are not handled; therefore, we 
do not represent ranges for Datatype propertied 



EquivalentClasses, EquivalentProperties, and allDifferent axioms, and the 
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4.1 OWL Lite DOB Syntax 

Our formalism, DOB, provides a set of EOB built-in predicates that represents all 
the axioms and restrictions of an OWL Lite subset. 

EOB predicates are ground, i.e., no variables are allowed as arguments. A set of 
lOB built-in predicates represents the semantics of the EOB predicates. We have 
followed the OWL Web Ontology Language Overview presented in ( McGuinness| 
a nd Harmelen 2004] ). 

Table [1] illustrates the EOB and lOB built-in predicates for an OWL Lite subseljj 
Note that some predicates refer to domain concepts (e.g. isClass, areClasses), and 
some to instance concepts (e.g. is islndividual, arelndividuals). 



Table 1. Some built-in EOB and lOB Predicates for a subset of OWL Lite 



EOB PREDICATE 



DESCRIPTION 



isOntologyCO) 
isImp0ntology(01,D2) 
isClassCC.D) 
isOProperty(P,D,R) 
isDPropertyCP,D) 
isTransitiveCP) 
subClassDf CC1,C2) 
AllValuesFromCC , P , D) 
islndividual C I ,C) 
isStatement (I ,P, J) 



An ontology has an Uri 
Ontology Dl imports ontology 02 

C is a class in ontology 

P is an object property with domain D and range R 

P is a datatype property with domain D 

P is a transitive property 

CI is subclass of C2 

C has property P with all values in D 

I is an individual belonging to class C 

I is an individual that has property P with value J 



lOB PREDICATE 



DESCRIPTION 



areSubClasses(Cl,C2) 

arelmpDntologies CDl ,02) 
areClasses (0,0) 
arelndividuals Cl ,C) 



CI arc the direct and indirect subclasses of C2 

01 import the ontologies 02 directly and indirectly 

C are all the classes of an ontology and its imported ontologies 

I are the individuals of a class and all of its direct and indirect 
superclasses C; or 

I are the individuals that participate in a property and belong to 
its domain or range C, or are values of a property with all values in C 



4.2 OWL Lite DOB Semantics 

A model-theoretic semantics for an OWL Lite (subset) DOB is as follows: 



cardinality restriction are not represented because they are syntactic sugar for other lan- 
guage constructs. 
^ We assume that the class owl: Thing is the default value for the domain and range of a property. 
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Table 2. Mapping OWL Lite subset to EOB Predicates 



OWL ABSTRACT SYNTAX 



EOB PREDICATES 



Ontology{0) 

Individual{01 value(owl : imports 02)) 

Ontology {O) ^ Class{C partial Thing) 

Class(A partial C) 

Class{Cl partial restriction{P allV aluesFrom{C2))) 

Class{A partial d . . . Cn) 

ObjectProperty{P domain{D)) , 
ObjectProperty{P range{R)) 

DatatypeProperty{P dornain(D)) 

Pro'perty(P Transitive) 

Individual{I type(C)) 

Individual{I value{P J)) 



isOntologyCO) 

impOntologyCDl, 02) 

isClass(C,D) 

subClassOf (A,C) 

allValuesFromCCl , P , C2) 

subClassOf (A,C1) , . . . , 
subClassOf (A,Cn) 

isOProperty(P,D,R) 

isDPropertyCP.D) 
isTransitive CP) 
isIndividualCI ,C) 
isStatement (I ,P, J) 



Table 3. Mapping OWL Lite subset Inference Rules to lOB Predicates 



OWL LITE INFERENCE RULES 



lOB RULE DEFINITIONS 



If subClassOf (CI, C2) and subClassOf (C2,C3) 
then subClassOf (CI, C3) 



If imp0ntology(01,02) and imp0ntology(02,03) 
then imp0ntology(01,03) 

If isClass(Cl,02) and imp0ntology(01 ,02) 
then isClass(Cl,01) 



If isSubClassOf (C1,C2) and isIndividuaKI ,C1) 
then isIndividuaKI ,C2) 

If isStatement(I,P,J) and isOProperty(P,C,R) 

then isIndividuaKI, C) 

If isStatement(I,P,J) and isOProperty(P,D,C) 

then isIndividual(J,C) 

If isStatement(I,P,J) and isDProperty(P,C) 

then isIndividuaKI, C) 

If AllValues(Cl,P,C) and isStatement (I ,P, J) 

and isIndividuaKI , CI) then isIndividuaKJ,C) 



areSubClasses(Cl,C2) : 
areSubClasses(Cl,C2) : 



-subClassOf (CI, C2) . 
-subClassOf (CI, C3) , 
areSubClasses(C3,C2) . 



arelmp0ntologies(01,02) : -iiiipOntology(01 ,02) . 
arelmp0ntologies(01,02) : -imp0ntology(01 ,03) , 

arelmp0ntologies(03,02) . 

areClasses(C,0) ;-isClass(C,0) . 
areClasses(C,01) : -isClass(C,02) , 

arelnip0ntologies(01 ,02) . 



arelndividuals (I ,C) : ■ 
arelndividuals (I ,C2) 

arelndividuals (I ,C) : ■ 

arelndividuals (J, C) : 

arelndividuals (I ,C) : ■ 

arelndividuals (J, C) :■ 



■isIndividuaKI ,C) . 
-isIndividuaKI , CI) , 
areSubClasses(Cl,C2) . 
■isOProperty(P,C,R) , 
areStatements (I ,P, J) . 
isOProperty(P,D,C) , 
areStatements (I ,P, J) . 
■isDProperty(P,C) , 
areStatements (I ,P, J) . 
■isIndividuaKI , CI) , 
allValuesFrom (CI , P , C) , 
areStatements (I ,P, J) . 



Definition 5 (Interpretation) 

An Interpretation / = (A^ ,V^ , /) consists of: 

• A non-empty interpretation domain A^ corresponding to the union of the 
sets of valid URIs of ontologies, classes, object and datatype properties, and 
individuals. These sets are pairwise disjoint. 
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• A set of interpretations V^ , of the EOB and lOB built-in predicates in Table 

III 

• An interpretation function / which maps each n-ary built-in predicate p^ € 
V^ to an n-ary relation Hj-Li ^^• 

Definition 6 {Satisfiability) 

Given an OWL Lite DOB T), an interpretation /, and a predicate p G P, / |= p iff: 

• p is an EOB predicate p(ii, ■■■,tn) and (ti, ...,i„) G p^ . 

• p is an lOB predicate R:H{X) ^- 3YB{X, Y), and whenever / satisfies each 
predicate in the body B, I also satisfies the predicate in the head H. 

Definition 7 (Model) 

Given an OWL Lite DOB V and an interpretation /, / is a model of V iff for every 

predicate p ^V, I \= p. 



4.3 Translation of OWL Lite to OWL Lite DOB 

A definition of a translation map from OWL Lite to OWL Lite DOB is the following: 

Definition 8 ( Translation) 

Given an OWL Lite theory O and an OWL Lite DOB theory V, an OWL Lite to 

DOB Translation T is a function T -.O ^V. 

Given an OWL Lite ontology O, an OWL Lite DOB ontology T> is defined as follows: 

• (Base Case) If o is an axiom or fact belonging to the sets of axioms or facts 
of O, then an EOB predicate T{o) is defined according to the EOB mappings 
in Table [2 

• If o is an OWL Lite inference rule, then an lOB predicate T{o) is defined 
according to the lOB mappings in Table [3] 

The translation ensures that the following theorem holds: 

Theorem 1 

Let O and T> be OWL Lite and OWL Lite DOB theories respectively, and T be an 

OWL Lite to DOB Translation such that, T{0) = V, then P |= C 

5 A Motivating Example 

Consider a 'cars and dealers' domain ontology carsOnt and Web source ontologies 
sourcel and source2. Source sourcel publishes information about all types of 
vehicles and dealers, whereas source2 is specialized in SUVs. 

The OWL Lite ontologies can be seen in Table |4] 

A portion of the example's EOB can be seen in Table [5] 

To illustrate a rule evaluation, we will take a query q that asks for the Web 
sources that publish information about 'traction': 
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Table 4. Example OWL Lite ontology 



Ontology carsOnt 



Ontology sourcel Ontology source2 



Class (vehicle partial Thing) 
Class Csuv partial vehicle) 
Class (car partial vehicle) 
DataProperty (price domain (vehicle)) 
Class (dealer partial Thing) 
ObjectProperty (sells domain (dealer) ) 
ObjectProperty (sells range (vehicle) ) 
DataProperty (traction domain(suv) ) 
DataProperty (model domain (vahicle)) 



imports carsOnt 



imports carsOnt 
individual (sl23 type(suv)) 



Table 5. Example DOB ontology 



EOB PREDICATES 



isOntology (sourcel) 



isOntology (carsOnt) 

impOntologyC sourcel , carsOnt) impOntologyC sour ce2, carsOnt) 

isClass (vehicle , carsOnt) 

subClassOf (suv, vehicle) 

isDProperty (price, vehicle) 



isOntology (source2) 

isClass (vehicle , carsOnt) 
i sClass (dealer , carsOnt) subClassDf (car, vehicle) 

isOProperty( sells, dealer, vehicle) isDProperty (model, vehicle) 
isDProperty (tract ion, suv) is Individual (s 123, suv) 



q(0) : -areClasses(C,D) , isDProperty (tractioiijC) . 

The answer to this query corresponds to all the ontologies with classes characterized 
by the property traction, i.e., ontologies sourcel, source2 and carsOnt. 

If we invert the ordering of the first two predicates in q, we will have an equivalent 
query q': 

q' (0) : -isDProperty(tractionjC) ,areClasses(C,0) . 

The cost or total number of inferred facts for q is larger than the cost for q'. 
In q, the number of instantiations or cardinality for the first intensional predicate 
areClasses(C,0) is twelve, four for each ontology, as sourcel and source2 inherit 
the classes in carsOnt. The cost of inferring these facts is dependent on the cost of 
evaluating the areClasses rule. In q', for the first subgoal isDProperty(traction,C), 
we have one instantiation: isDProperty (traction, suv). Again, the cost of inferring 
this fact depends on the cost of the isDProperty predicate. 

Note that statistics on the size and argument values of the EOB isDProperty 
predicate can be computed, whereas statistics for the lOB areClasses predicate 
will have to be estimated as data is not known a priori. Once the cost of each 
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query predicate is determined, we may apply a cost-based join-ordering optimiza- 
tion strategy. 



6 DOB Hybrid Cost Model 

The process of answering a query relies on inferring facts from the predicates in 
the DOB. Our cost metric is focused on the number of intermediate facts that 
need to be inferred in order to answer the query. The objective is to find an order 
of the predicates in the body of the query, such that the number of intermediate 
inferred facts is reduced. We will apply a join-ordering optimization strategy a la 



System R using Datalog-relational equivalences ( Abiteboul et al. 1995 1. To estimate 
the cardinality and evaluation cost of the intensional predicates, we have applied an 
adaptive sampling technique. Thus, we propose a hybrid cost model which combines 
adaptive sampling and traditional relational cost models. 



6.1 Adaptive Sampling Technique 

We have developed a sampling technique that is based on the adaptive sampling 



method proposed by Lipton, Naughton, and Schneider (Lipton and Naughton 1990 



Lipton et al. 19901. This technique assumes that there is a population P of all 
the different valid instantiations of a predicate P, and that P is divided into n 
partitions according to the n possible instantiations of one or more arguments of 
P. Each element in P is related to its evaluation cost and cardinality, and the 
population P is characterized by the statistics mean and variance. 

The objective of the sampling is to identify a sample of the population P, called 
EP, such that the mean and variance of the cardinality (resp. evaluation cost) of 
EP are valid to within a predetermined accuracy and confidence level. 

To estimate the mean of the cardinality (resp. cost) of EP, say Y , within ^ with 
probability p, where < p < 1 and d > 0, the sampling method assumes an urn 
model. 

The urn has n balls from which m samplings are repeatedly taken, until the sum 

Y ' 



z of the cardinalities (resp. costs) of the samples is greater than a x {y), where 



a — ^^ Z-^ . The estimated mean of the cardinality (resp. cost) is: Y 



The values d and ?= are associated with the relative error and the confidence 

level, and S and Y represent the cardinality (resp. cost) variance and mean of P. 

Since statistics of P are unknown, the upper bound a x ^ is replaced by a x b{n). 

To approximate b(n) for cost and cardinality estimates, we apply Double Sam- 



pling (Ling and Sun 19921. In the first stage we randomly evaluate k samples and 
take the maximum value among them: 

b{n) = max'l^i{card{Pi)) (resp. b{n) = maXi^i{cost{Pi))) , where I < k < n 

It has been shown that a few samples are necessary in order for the distribution 
of the sum to begin to look normal. Thus, the factor 1/(1 — ^) may be improved by 
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central limit theorem (Lipton et al. 1990). This improvement allows us to achieve 



accurate estimations and lower bounds. 

6.1.1 Estimating cardinality. 

Given an intensional predicate P, the cardinality of P corresponds to the number 
of valid instantiations of P (Definition pi). In our previous example, the number of 
ontology values obtained in the answer of the query is estimated using this metric. 

To estimate the cardinality of P, we execute the adaptive sampling algorithm 
explained before, by selecting any argument of P, and partitioning P according to 
the chosen argument. The cardinality estimation will be card{P) =Yxn, where n 
is the number of partitions, i.e. the number of different instantiations for the chosen 
argument. 

Note that once the cardinality of the non-instantiated P is estimated, we can es- 
timate the cardinality of the instantiated predicate by using the selectivity value(s) 
of the instantiated argument (s). 

6.1.2 Estimating cost. 

The cost of P measures the number of intermediate inferred facts (Definition E| . 
For instance, to estimate the cost of a predicate P{X,Y), we consider the different 
instantiation patterns that the predicate can have, i.e., we independently estimate 
the cost for P{X\ F'), P(X', Y^), P{Xf ,Y'') and P{X^' , Y^), where b and /indicate 
that the argument is bound and free, respectively. 

The computation of several cost estimates is necessary because in Datalog top- 



down evaluation (Abiteboul et al. 1995), the cost of an instantiated intensional 



predicate cannot be accurately estimated from the cost of a non-instantiated pred- 
icate (using selectivity values). Instantiated arguments will propagate in the lOB 
rule's body through sideways-passing, and cost varies according to the binding pat- 
terns. For example, the cost of areClasses(Cl'',C2^) may be smaller than the cost 
of areClasses(Cl-^,C2''), i.e., the bound argument CI "pushes" instantiations in the 
definition of the rule: 

areSubClasses(Cl,C2) :-isSubClass(Cl ,C3) ,areSubClasses(C3,C2) . 

making its body predicates more selective. 

For P(X^F''), P(X^y^) and P(X^,y''), we partition P according to the bound 
arguments. In these cases we are estimating the cost of one partition. Therefore, 
cost{P) = ^ = F . 

Finally, to estimate the cost of P{X^,Y^), we choose an argument of P and 
partition P according to the chosen argument. To reduce the cost of computing the 
estimate, we choose the most selective argument. The cost estimate is cost{P) = 
Y X n. 

6.1.3 Determining the number of partitions n. 

For both, cost and cardinality estimates, we need to determine the number of possi- 
ble instantiations, n, of the chosen argument. This value depends on the semantics 
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of the particular predicate. For instance, for an interpretation /, 

areClasses{Class,Ont) (Z C x O 

where C is the set of vahd class URIs and O is the set of valid ontology URIs. \C\ 
corresponds to the number of EOB predicates iaC'lass{Class,Ont), i.e., 

\C\ — Card(isClass{Class,Ont)) 

Similarly, \0\=Card{isOntology{Ont)); these cardinalities are pre-computed offline. 
We assume that the values are uniformly distributed. 

6.2 System R Technique 

To estimate the cardinality and cost of two or more predicates, we use the cost 
model proposed in System R. The cardinality of the conjunction of predicates Pi,P2 
is described by the following expression: 

card{Pi, P2) = card{Pi) x card{P2) x reductionFactor{Pi, P2) 

r eductionF actor {Pi, P2) reflects the impact of the sideways passing variables in re- 
ducing the cardinality of the result. This value is computed assuming that sideways 



passing variables are independent and each is uniformly distributed ( Selinger et al 



19791. For cost estimation, we consider three evaluation strategies: 



1. Nested-Loop Join 

Following a Nested-Loop Join evaluation strategy, for each valid instantiation 
in Pi, we retrieve a valid instantiation in P2 with a matching "join" argument 
value: 

cost{Pi,P2) = cosi(Pi) + card(Pi) x cost'"°\P2) 

cosf""* {P2) corresponds to the estimate of the cost of the predicate P2 where 
the "join" arguments are instantiated in P2, i.e., all the sideways passing vari- 
ables from Pi to P2 are bound in P2. These binding patterns were considered 
during the sampling-based estimation of the cost of P2- 

2. Block Nested-Loop Join 

Predicate Pi is evaluated into blocks of fixed size, and then each block is 
"joined" with P2. 

costiPi,P2) = cost{Pi) + r J^"4^1 >^ cost{P2) 

BLockbize 

3. Hash Join 

A hash table is built for each predicate according to their join argument. The 
valid instantiations of predicates Pi and P2 with the same hash key will be 
joined together: 

cosi(Pi, P2) = cost{Pi) + COSt{P2) 

Although the sampling technique is appropiate for estimating a single predicate, 
it may be inefficient for estimating the size of a conjunction of more than two 
predicates. 



The sampling algorithm in ( Lipton and Naughton 1990 ) suggests that for a con- 



junction of two predicates, P, Q, if the size of P is n, the query is n-partitionable, 
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Table 6. Query Optimization Algorithm 

Algorithm Dynamic Programming 

INPUT: Predicate: a set of predicates, Pi,...,P„. OUTPUT: OrderedPredlcate: an ordering of 
Predicate 

1. SubPaths— Predicate; 

2. For i=l to n 

(a) For each solution Subj in SubPaths 

i For each predicate P^ in Predicate 

• If there are sideways passing variables from Subj to Pz , 
then add Sub— Subj^P^ to NewSubPaths 

(b) Remove from NewSubPaths any subpath Sub^ iff there is another subpath Subi in 
NewSubPaths, such that, Subi and Sub^ are equivalent, and Subi is better than Subk- 

(c) SubPaths=NewSubPaths 

(d) Reset NewSubPaths 

3. Return the path in SubPaths with lowest cost. 



i.e., for each valid instantiation p in P, the corresponding partition of Q contains 
all the valid instantiations q in Q such that q "joins" p. Therefore, when the size of 
the first predicate in a query is small, its sample size may be larger. This problem 
can be extended to conjunctive queries with several subgoals, so when the number 
of intermediate results is small, sampling time may be as large as evaluation time. 



6.3 Query Optimization 

In Figure |6.3| we present the algorithm used to optimize the body of a query. 
The proposed optimization algorithm extends the System R dynamic-programming 
algorithm by identifying orderings of the n EOB and lOB predicates in a query. 
During each iteration of the algorithm, the best intermediate sub-plans are chosen 
based on cost and cardinality. In the last iteration, final plans are constructed and 
the best plan is selected in terms of the cost metric. 

During each iteration i between 2 and n-1, different orderings of the predicates are 
analyzed. Two subplans are considered equivalents if and only if, they are composed 
by the same predicates. A subplan SPi is better than a subplan SPj if and only if, 
the cost and cardinality of SPj are greater than the cost and cardinality of SPi, 
respectively. If SPi cost is greater than SPj cost, but SPj cardinality is greater than 
SPi cardinality, i.e. they are un-comparable, then the equivalence class is annotated 
with the two subplans. 



7 Experimental Results 

An experimental study was conducted for synthetic and real-world ontologies. Ex- 
periments on synthetic ontologies were executed on a SunBlade 150 (650MHz) with 
1GB RAM; experiments on real-world ontologies were executed on a SunFire V440 
(1281MHz) with 16GB RAM. Our system was implemented in SWI-Prolog 5.6.1. 
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We have studied three real- world ontologies: Travel (Shell 20021, EHR_RM (Pro 



tege staff 19991, and GALEN (Open Clinical Organization 20011 



Our cost metrics are the number of intermediate facts for synthetic and real- world 
ontologies, and the evaluation time for real-world ontologies. In our experiments, 
the sampling parameters d (the error), p (the confidence level), and k (the size of 
the sample for the first stage) were set to 0.2, 0.7 and 7, respectively. We developed 
two sets of experiments according to the evaluation strategies considered: (1) the 
Nested-Loop join evaluation strategy, and (2) the combination of Nested-Loop, 
Block Nested-Loop and Hash join evaluation strategies. Our study consisted of the 
following: 

• Cost Model Predictive Capability: In Figure |2k, we report the correlation 
among the estimated values and the actual cost for synthetic ontologies con- 
sidering the Nested-loop Join evaluation strategy. Synthetic ontologies were 
randomly generated following a uniform distribution. We generated ten onto- 
logy documents and three chain and star queries with three subgoals for each 
ontology; the cost of each ordering was estimated with our cost model, and 
each ordering was then evaluated against the ontology; this gives us a total 
of six hundred queries. The correlation is 0.92. 
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(b) 



Fig. 2. (a) Correlation of estimated cost to actual cost (log. scale) - nested-loop 
join - Synt. ontologies; (b) Correlation of estimated cost to actual cost (log. scale) 
- nested-loop join - GALEN 



Table 7. Correlation values for real-world ontologies 



Nested-Loop Join Three Evaluation Strategies 



Travel 
EHR_RM 



0.96 
0.98 



0.94 
0.92 



In Figure |2]d, we report the same correlation metric for the real- world ontology 
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Fig. 3. (a) #Pred. optimal ordering vs. #Pred. worst ordering - nested-loop-join - 
Synt. Ontologies; (b) #Pred. optimal ordering vs. #Pred. median ordering - nested- 
loop-join - Synt. Ontologies 
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Fig. 4. (a) #Pred. optimal ordering vs. #Pred. worst ordering - nested-loop-join 
- EHR_RM; (b) #Pred. optimal ordering vs. ^^Pred. worst ordering - combination 
evaluation strategies - EHRJIM 



GALEN, and the value is 0.62. In TablejT) we present correlation values for the 
real- world ontologies Travel and EHR_RM for our two sets of experiments: the 
accuracy of the Nested-Loop join cost model is similar to the accuracy of the 
cost model that considers the combination of the three evaluation strategies. 
Cost improvements: We also conducted experiments to study cost improve- 
ment using the optimizer. We evaluated all the orderings of each query, then 
we ran the optimizer and evaluated the optimized query. Figure [3^ reports 
the ratio of the cost of the optimal ordering to the cost of the worst ordering 
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considering only nested-loop join, ''ll^^wlTsfordering ' ^^^ queries against syn- 
thetic ontologies. For synthetic ontologies, this ratio is less than 10% for most 
of the queries. We also computed the proportion of the optimal ordering cost 
with respect to the median ordering cost. The results for synthetic ontologies 
show that the optimal ordering cost is less than 40% of the median for fifteen 
of twenty queries; this result can be observed in Figure |3]d. 
In Figure |4^, we report the ratio of the cost of the optimal ordering to the 
cost of the worst ordering considering only nested-loop join for EHR_RM. 
Additionally, Figure Hb reports the same metric considering the combination 
of the three evaluation strategies. We can observe that the ratio improves 
when the combination of the different strategies is considered: for nested-loop 
join the mean of this ratio is 0.10, whereas for the combination of strategies 
the mean is 0.07; this is because the optimizer searches in a larger space of 
possibilities, increasing the chance of finding better query plans. 

In general, we may state that the results show a significant improvement in the 
evaluation cost for the optimized queries with respect to the worst-case and median- 
case query orderings. This property holds for synthetic and real-world ontologies. 
However, for synthetic ontologies we notice that for star-shaped queries, the dif- 
ference between the median cost and the optimal cost is very small; this indicates 
that the form of the query may influence the cost improvement achieved by the 
optimizer. 
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Fig. 5. Sampling Conjunctions - Query Eval. time and Sample Eval. time vs. ^ 
Inf. Pred. 



Finally, we would like to point out that we also studied the use of an adaptive 
sampling technique for the cost estimation of the conjunction of two or more pre- 
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dicates (instead of System R cost model). Although, the sampling technique gives 
a similar correlation result than the combination of sampling and System R cost 
model, the time required to compute the cost estimation may be as large as the time 
needed to evaluate the query. In Figure [S] we can observe that the time difference 
is marginal. 



8 Conclusions and Future Work 

We have developed a cost model that combines System R and adaptive sampling 
techniques. Adaptive sampling is used to estimate data that do not exist a priori, 
data related to the cardinality and cost of intensional rules in the DOB. The exper- 
imental results show that our proposed techniques produce in general a significant 
improvement in the evaluation cost for the optimized query. 

Currently, we are developing a hybrid optimization mechanism that combines 
Magic Sets and our cost-based technique; the idea is to first identify a good order- 
ing, and then apply Magic Sets rewritings to reduce the program that evaluates 
the query. Initial experiments show that this combined solution outperforms the 
behavior of each individual technique. 

We plan to apply similar optimization techniques for conjunctive queries to DL 
ontologies. Initially, we will work on ABox queries extending the the techniques 



proposed in (Sirin and Parsia 20061. In a next stage, we will consider mixed TBox 



and ABox conjunctive queries. 
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